In order to better understand the context, trends, and underlying causes of homicide in a particular location or across numerous regions, our study intends to provide a thorough analysis of homicide report data. This analysis includes the investigation of a number of homicide-related issues, including victim characteristics, crime scenes, suspects' intentions, and the influence of societal, economic, and environmental factors.
Homicide is a serious social problem with ramifications across society. It affects not only the criminal justice system but also critically affects social and public health. The focus of our study is to look deeply into homicide report data, which normally contains details about each homicide case, including the date, time, location, victim's characteristics, suspect information, cause of death, and, in some circumstances, the crime's motivation.
FILL IN HERE "Introduction (6 points)
An introduction that sets the stage for your analysis, describes the questions you attempted to answer at a high level, explains why the answers to the question matter, and provides a detailed description of the data set(s) and how they were acquired. Do not just copy the introduction from your proposal/project update. The final report introduction must be based on the report you are presenting."
FILL IN HERE "Choice for Heavier Grading on Data Processing or Data Analysis (1 point)
A description of whether your project should be graded more heavily on data processing or data analysis. You must choose one option or the other. If you select data processing, clearly describe why you believe the work you did goes above and beyond basic data processing needed for most data sets. If you select data analysis, clearly describe why you believe the work you did goes above and beyond basic data analysis needed to answer your questions."
We start with importing data from Kaggle, where we carefully review a CSV file to make sure it is appropriate for our analysis environment. After that, we turn our attention to data cleansing and missing data management, which is a comprehensive review process that includes finding and addressing missing or unnecessary information as well as getting rid of duplicate records. We perform basic evaluations to comprehend data distributions, identify outliers, and display patterns that serve as a basis for more in-depth insights before fluidly transitioning into exploratory data analysis.we investigate victim age, sex, weapon type, offender sex, and victim race by utilizing the analytical features of libraries such as Pandas and NumPy. Age distribution is shown, and gender, the use of weapons, and the demographics of the offenders are all subtly understood by filtering procedures. We utilize MATLAB's robust data visualization features to clarify the weapon and victim race distribution in the dataset.
# Import all required libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from numpy import nan as NA
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
import plotly.express as px
import scipy
import plotly.graph_objects as go
# We would like to load our csv file in to a dataframe for processing and analysis
path = ''
homicides_df = pd.read_csv(path + 'database_Homicide.csv', dtype = { 16: 'str' })
As Perpetrator Age column contains values of mutiple datatypes, we import it as string.
# Review dataset
homicides_df.head(10)
| Record ID | Agency Code | Agency Name | Agency Type | City | State | Year | Month | Incident | Crime Type | ... | Victim Ethnicity | Perpetrator Sex | Perpetrator Age | Perpetrator Race | Perpetrator Ethnicity | Relationship | Weapon | Victim Count | Perpetrator Count | Record Source | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | January | 1 | Murder or Manslaughter | ... | Unknown | Male | 15 | Native American/Alaska Native | Unknown | Acquaintance | Blunt Object | 0 | 0 | FBI |
| 1 | 2 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | March | 1 | Murder or Manslaughter | ... | Unknown | Male | 42 | White | Unknown | Acquaintance | Strangulation | 0 | 0 | FBI |
| 2 | 3 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | March | 2 | Murder or Manslaughter | ... | Unknown | Unknown | 0 | Unknown | Unknown | Unknown | Unknown | 0 | 0 | FBI |
| 3 | 4 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | April | 1 | Murder or Manslaughter | ... | Unknown | Male | 42 | White | Unknown | Acquaintance | Strangulation | 0 | 0 | FBI |
| 4 | 5 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | April | 2 | Murder or Manslaughter | ... | Unknown | Unknown | 0 | Unknown | Unknown | Unknown | Unknown | 0 | 1 | FBI |
| 5 | 6 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | May | 1 | Murder or Manslaughter | ... | Unknown | Male | 36 | White | Unknown | Acquaintance | Rifle | 0 | 0 | FBI |
| 6 | 7 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | May | 2 | Murder or Manslaughter | ... | Unknown | Male | 27 | Black | Unknown | Wife | Knife | 0 | 0 | FBI |
| 7 | 8 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | June | 1 | Murder or Manslaughter | ... | Unknown | Male | 35 | White | Unknown | Wife | Knife | 0 | 0 | FBI |
| 8 | 9 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | June | 2 | Murder or Manslaughter | ... | Unknown | Unknown | 0 | Unknown | Unknown | Unknown | Firearm | 0 | 0 | FBI |
| 9 | 10 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | June | 3 | Murder or Manslaughter | ... | Unknown | Male | 40 | Unknown | Unknown | Unknown | Firearm | 0 | 1 | FBI |
10 rows × 24 columns
# Checking our data rows and coloumns
homicides_df.shape
print(f'No. of rows: {homicides_df.shape[0]}')
print(f'No. of colums: {homicides_df.shape[1]}')
No. of rows: 638454 No. of colums: 24
# We would like to load our csv file in to a dataframe for processing and analysis
path = ''
locations_df = pd.read_csv(path + 'uscities.csv', usecols = [0, 3, 6, 7])
locations_df = locations_df.rename(columns = {'city': 'City', 'lat': 'Latitude', 'lng': 'Longitude', 'state_name': 'State'})
locations_df.head(10)
| City | State | Latitude | Longitude | |
|---|---|---|---|---|
| 0 | New York | New York | 40.6943 | -73.9249 |
| 1 | Los Angeles | California | 34.1141 | -118.4068 |
| 2 | Chicago | Illinois | 41.8375 | -87.6866 |
| 3 | Miami | Florida | 25.7840 | -80.2101 |
| 4 | Houston | Texas | 29.7860 | -95.3885 |
| 5 | Dallas | Texas | 32.7935 | -96.7667 |
| 6 | Philadelphia | Pennsylvania | 40.0077 | -75.1339 |
| 7 | Atlanta | Georgia | 33.7628 | -84.4220 |
| 8 | Washington | District of Columbia | 38.9047 | -77.0163 |
| 9 | Boston | Massachusetts | 42.3188 | -71.0852 |
# Dropping some columns that we will not use for later analysis
homicides_df.drop(columns=['Record Source', 'Perpetrator Race', 'Perpetrator Ethnicity'], inplace=True)
The code eliminates these columns in an attempt to streamline the dataset, increase computing performance, cut down on redundancy, and concentrate on the most important aspects for the latter phases of analysis. In order to enable a more effective and efficient analysis process, this operation proposes a planned and focused approach, highlighting the significance of feature selection and simplified data.
# checking to see if we have any duplicates
print(f'No. of duplicate rows: {homicides_df.duplicated().sum()}')
No. of duplicate rows: 0
There are no duplicated records.
# Using this to check for missing/blank values in the dataset column wise
missing_vals = homicides_df.isnull().sum()
missing_vals
Record ID 0 Agency Code 0 Agency Name 0 Agency Type 0 City 0 State 0 Year 0 Month 0 Incident 0 Crime Type 0 Crime Solved 0 Victim Sex 0 Victim Age 0 Victim Race 0 Victim Ethnicity 0 Perpetrator Sex 0 Perpetrator Age 0 Relationship 0 Weapon 0 Victim Count 0 Perpetrator Count 0 dtype: int64
We have found no missing values
While reviewing the dataframe we found that instead of NaN we have 'Unknown' as the missing value. So, let's replace it.
# Find count of 'Unknown' in each column
unknown_counts = {}
for column in homicides_df.columns:
unknown_counts[column] = (homicides_df[column] == 'Unknown').sum()
print(unknown_counts)
{'Record ID': 0, 'Agency Code': 0, 'Agency Name': 47, 'Agency Type': 0, 'City': 0, 'State': 0, 'Year': 0, 'Month': 0, 'Incident': 0, 'Crime Type': 0, 'Crime Solved': 0, 'Victim Sex': 984, 'Victim Age': 0, 'Victim Race': 6676, 'Victim Ethnicity': 368303, 'Perpetrator Sex': 190365, 'Perpetrator Age': 0, 'Relationship': 273013, 'Weapon': 33192, 'Victim Count': 0, 'Perpetrator Count': 0}
# Replace 'Unknown' with Nan
homicides_df.replace('Unknown', NA,inplace=True)
homicides_df.head(10)
| Record ID | Agency Code | Agency Name | Agency Type | City | State | Year | Month | Incident | Crime Type | ... | Victim Sex | Victim Age | Victim Race | Victim Ethnicity | Perpetrator Sex | Perpetrator Age | Relationship | Weapon | Victim Count | Perpetrator Count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | January | 1 | Murder or Manslaughter | ... | Male | 14 | Native American/Alaska Native | NaN | Male | 15 | Acquaintance | Blunt Object | 0 | 0 |
| 1 | 2 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | March | 1 | Murder or Manslaughter | ... | Male | 43 | White | NaN | Male | 42 | Acquaintance | Strangulation | 0 | 0 |
| 2 | 3 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | March | 2 | Murder or Manslaughter | ... | Female | 30 | Native American/Alaska Native | NaN | NaN | 0 | NaN | NaN | 0 | 0 |
| 3 | 4 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | April | 1 | Murder or Manslaughter | ... | Male | 43 | White | NaN | Male | 42 | Acquaintance | Strangulation | 0 | 0 |
| 4 | 5 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | April | 2 | Murder or Manslaughter | ... | Female | 30 | Native American/Alaska Native | NaN | NaN | 0 | NaN | NaN | 0 | 1 |
| 5 | 6 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | May | 1 | Murder or Manslaughter | ... | Male | 30 | White | NaN | Male | 36 | Acquaintance | Rifle | 0 | 0 |
| 6 | 7 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | May | 2 | Murder or Manslaughter | ... | Female | 42 | Native American/Alaska Native | NaN | Male | 27 | Wife | Knife | 0 | 0 |
| 7 | 8 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | June | 1 | Murder or Manslaughter | ... | Female | 99 | White | NaN | Male | 35 | Wife | Knife | 0 | 0 |
| 8 | 9 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | June | 2 | Murder or Manslaughter | ... | Male | 32 | White | NaN | NaN | 0 | NaN | Firearm | 0 | 0 |
| 9 | 10 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | June | 3 | Murder or Manslaughter | ... | Male | 38 | White | NaN | Male | 40 | NaN | Firearm | 0 | 1 |
10 rows × 21 columns
We know that Perpetrator Age column has values in string format. Let's convert it into integer so we can perform caclulations easily
# to change the type of perpetrator age
homicides_df['Perpetrator Age'] = pd.to_numeric(homicides_df['Perpetrator Age'], errors='coerce')
We found many outliers based on age. We want to remove them for both Victim and Perpetrator.
Let's do for perpetrator first.
#Calculate the mean and standard deviation of the column
mean = homicides_df['Perpetrator Age'].mean()
std_dev = homicides_df['Perpetrator Age'].std()
# Define a threshold (for instance, outliers beyond 3 standard deviations)
threshold = 3
# Filter the DataFrame to exclude values beyond the threshold
homicides_df = homicides_df[(homicides_df['Perpetrator Age'] < (mean + threshold * std_dev)) & (homicides_df['Perpetrator Age'] > (mean - threshold * std_dev))]
homicides_df['Perpetrator Age'].describe()
count 634745.000000 mean 19.971691 std 17.332278 min 0.000000 25% 0.000000 50% 21.000000 75% 31.000000 max 73.000000 Name: Perpetrator Age, dtype: float64
Let's do it for Victim's age now.
#Calculate the mean and standard deviation of the column
mean = homicides_df['Victim Age'].mean()
std_dev = homicides_df['Victim Age'].std()
# Define a threshold (for instance, outliers beyond 3 standard deviations)
threshold = 3
# Filter the DataFrame to exclude values beyond the threshold
homicides_df = homicides_df[(homicides_df['Victim Age'] < (mean + threshold * std_dev)) & (homicides_df['Victim Age'] > (mean - threshold * std_dev))]
homicides_df['Victim Age'].describe()
count 633775.000000 mean 33.386476 std 17.624169 min 0.000000 25% 22.000000 50% 30.000000 75% 41.000000 max 99.000000 Name: Victim Age, dtype: float64
(homicides_df['Victim Age'] == 0).sum()
8442
(homicides_df['Perpetrator Age'] == 0).sum()
215687
As the 0 value of age is not possible, we can consider it as unknown value. So, we replace 0 with NaN.
homicides_df['Victim Age'] = np.where(homicides_df['Victim Age'] == 0, np.nan, homicides_df['Victim Age'])
homicides_df['Victim Age'].describe()
count 625333.000000 mean 33.837194 std 17.307616 min 1.000000 25% 22.000000 50% 30.000000 75% 42.000000 max 99.000000 Name: Victim Age, dtype: float64
homicides_df['Perpetrator Age'] = np.where(homicides_df['Perpetrator Age'] == 0, np.nan, homicides_df['Perpetrator Age'])
homicides_df['Perpetrator Age'].describe()
count 418088.000000 mean 30.298043 std 11.954250 min 1.000000 25% 21.000000 50% 27.000000 75% 37.000000 max 73.000000 Name: Perpetrator Age, dtype: float64
homicides_df = pd.merge(homicides_df, locations_df, on = ['City','State'], how='left')
homicides_df.head(10)
| Record ID | Agency Code | Agency Name | Agency Type | City | State | Year | Month | Incident | Crime Type | ... | Victim Race | Victim Ethnicity | Perpetrator Sex | Perpetrator Age | Relationship | Weapon | Victim Count | Perpetrator Count | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | January | 1 | Murder or Manslaughter | ... | Native American/Alaska Native | NaN | Male | 15.0 | Acquaintance | Blunt Object | 0 | 0 | 61.1508 | -149.1091 |
| 1 | 2 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | March | 1 | Murder or Manslaughter | ... | White | NaN | Male | 42.0 | Acquaintance | Strangulation | 0 | 0 | 61.1508 | -149.1091 |
| 2 | 3 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | March | 2 | Murder or Manslaughter | ... | Native American/Alaska Native | NaN | NaN | NaN | NaN | NaN | 0 | 0 | 61.1508 | -149.1091 |
| 3 | 4 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | April | 1 | Murder or Manslaughter | ... | White | NaN | Male | 42.0 | Acquaintance | Strangulation | 0 | 0 | 61.1508 | -149.1091 |
| 4 | 5 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | April | 2 | Murder or Manslaughter | ... | Native American/Alaska Native | NaN | NaN | NaN | NaN | NaN | 0 | 1 | 61.1508 | -149.1091 |
| 5 | 6 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | May | 1 | Murder or Manslaughter | ... | White | NaN | Male | 36.0 | Acquaintance | Rifle | 0 | 0 | 61.1508 | -149.1091 |
| 6 | 7 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | May | 2 | Murder or Manslaughter | ... | Native American/Alaska Native | NaN | Male | 27.0 | Wife | Knife | 0 | 0 | 61.1508 | -149.1091 |
| 7 | 8 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | June | 1 | Murder or Manslaughter | ... | White | NaN | Male | 35.0 | Wife | Knife | 0 | 0 | 61.1508 | -149.1091 |
| 8 | 9 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | June | 2 | Murder or Manslaughter | ... | White | NaN | NaN | NaN | NaN | Firearm | 0 | 0 | 61.1508 | -149.1091 |
| 9 | 10 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | June | 3 | Murder or Manslaughter | ... | White | NaN | Male | 40.0 | NaN | Firearm | 0 | 1 | 61.1508 | -149.1091 |
10 rows × 23 columns
# Setting the index as the 'Record ID' column for easier viewing
homicides_df.set_index('Record ID', inplace = True)
homicides_df.head(10)
| Agency Code | Agency Name | Agency Type | City | State | Year | Month | Incident | Crime Type | Crime Solved | ... | Victim Race | Victim Ethnicity | Perpetrator Sex | Perpetrator Age | Relationship | Weapon | Victim Count | Perpetrator Count | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Record ID | |||||||||||||||||||||
| 1 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | January | 1 | Murder or Manslaughter | Yes | ... | Native American/Alaska Native | NaN | Male | 15.0 | Acquaintance | Blunt Object | 0 | 0 | 61.1508 | -149.1091 |
| 2 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | March | 1 | Murder or Manslaughter | Yes | ... | White | NaN | Male | 42.0 | Acquaintance | Strangulation | 0 | 0 | 61.1508 | -149.1091 |
| 3 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | March | 2 | Murder or Manslaughter | No | ... | Native American/Alaska Native | NaN | NaN | NaN | NaN | NaN | 0 | 0 | 61.1508 | -149.1091 |
| 4 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | April | 1 | Murder or Manslaughter | Yes | ... | White | NaN | Male | 42.0 | Acquaintance | Strangulation | 0 | 0 | 61.1508 | -149.1091 |
| 5 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | April | 2 | Murder or Manslaughter | No | ... | Native American/Alaska Native | NaN | NaN | NaN | NaN | NaN | 0 | 1 | 61.1508 | -149.1091 |
| 6 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | May | 1 | Murder or Manslaughter | Yes | ... | White | NaN | Male | 36.0 | Acquaintance | Rifle | 0 | 0 | 61.1508 | -149.1091 |
| 7 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | May | 2 | Murder or Manslaughter | Yes | ... | Native American/Alaska Native | NaN | Male | 27.0 | Wife | Knife | 0 | 0 | 61.1508 | -149.1091 |
| 8 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | June | 1 | Murder or Manslaughter | Yes | ... | White | NaN | Male | 35.0 | Wife | Knife | 0 | 0 | 61.1508 | -149.1091 |
| 9 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | June | 2 | Murder or Manslaughter | No | ... | White | NaN | NaN | NaN | NaN | Firearm | 0 | 0 | 61.1508 | -149.1091 |
| 10 | AK00101 | Anchorage | Municipal Police | Anchorage | Alaska | 1980 | June | 3 | Murder or Manslaughter | Yes | ... | White | NaN | Male | 40.0 | NaN | Firearm | 0 | 1 | 61.1508 | -149.1091 |
10 rows × 22 columns
The code restructures the DataFrame so that the 'Record ID' values serve as the row identifiers by assigning the 'Record ID' column to the index. This change was made to improve data exploration's simplicity of use and clarity. By assigning a unique identification to the index, We are able to get individual entries, make data retrieval more efficient, and make the DataFrame easier to read overall.
To get a better overview of dataset, we will do exploratory data analysis.
# we are trying to see which of our data are numerical or catagorical for better understanding of our data
def get_var_category(series):
unique_count = series.nunique(dropna=False)
total_count = len(series)
if pd.api.types.is_numeric_dtype(series):
return 'Numerical'
elif pd.api.types.is_datetime64_dtype(series):
return 'Date'
elif unique_count==total_count:
return 'Text (Unique)'
else:
return 'Categorical'
def print_categories(df):
for column_name in df.columns:
print(column_name, ": ", get_var_category(df[column_name]))
print_categories(homicides_df)
Agency Code : Categorical Agency Name : Categorical Agency Type : Categorical City : Categorical State : Categorical Year : Numerical Month : Categorical Incident : Numerical Crime Type : Categorical Crime Solved : Categorical Victim Sex : Categorical Victim Age : Numerical Victim Race : Categorical Victim Ethnicity : Categorical Perpetrator Sex : Categorical Perpetrator Age : Numerical Relationship : Categorical Weapon : Categorical Victim Count : Numerical Perpetrator Count : Numerical Latitude : Numerical Longitude : Numerical
The code gives a thorough description of the characteristics of each column in the dataset by differentiating between date, category, and numerical variables. This classification is crucial since it provides the framework for customized analytical methods and well-informed decision-making. we are better equipped to use suitable statistical measurements, visualization approaches, and modeling tactics when we know whether a variable is numerical, categorical, or temporal. Additionally, the code highlights columns that have distinct text values, indicating possible areas that need extra investigation, including abnormalities or inconsistencies in the data.
# we are trying to descibe age trend of victim from our data set
homicides_df['Victim Age'].describe()
count 625578.000000 mean 33.838722 std 17.308328 min 1.000000 25% 22.000000 50% 30.000000 75% 42.000000 max 99.000000 Name: Victim Age, dtype: float64
We discover that half of the victims are 30 years of age or younger, with a median age of 30.0, providing a measure of central tendency. The middle 50% of ages appear to lie within the interquartile range (IQR) of 22.0 to 41.0, indicating a wide variety of victim ages. But the extreme figures (0 and 998) raise red flags, suggesting that there could be outliers or issues with the quality of the data that need to be looked into further. The age distribution appears to be skewed to the right, as indicated by the somewhat wider range between Q2 and Q3 than between Q1 and Q2. The general tendency and diversity in victim ages are both highlighted by this detailed study, highlighting the significance of taking into account the larger context and any anomalies in the data in later conclusions.
# we are trying to descibe age trend of perpetrator from our data set
homicides_df['Perpetrator Age'].describe()
count 418303.000000 mean 30.299360 std 11.954775 min 1.000000 25% 21.000000 50% 27.000000 75% 37.000000 max 73.000000 Name: Perpetrator Age, dtype: float64
With a total count of 638,453 entries, the dataset shows a broad variety of ages for the perpetrators, from 0 to 99 years old. The average age of those who commit crimes is roughly 20.32 years old, which indicates a comparatively youthful profile. The 17.89 standard deviation, on the other hand, suggests that there is a good deal of variation around this mean. The right-skewed distribution is indicated by the quartiles, which show that 25% of the instances contain perpetrators who are 0 years old, while the median (50th percentile) age is 21. The wide range of ages represented in the perpetrators of crimes is shown by the interquartile range (IQR) that extends from 0 to 31 years between the 25th and 75th percentiles. Interestingly, a sizable percentage of cases (25th percentile) have offenders who are under the age of zero, which may indicate cases in which the offender's age is either irrelevant or unknown.
Let's take a closer look at the data using some visualizations. We want to answer a variety of questions based on the data, so that we can improve our understanding of the homicide data and maybe even predict some trends. Mainly, we want to be able to view the data in a concise manner and then make some inferences based on our visualizations.
Let's get started!
What are the most commonly used weapons in homicides and why might this be the case? What are the regions or cities that have the most homicides? Why?
Let us first discuss part one of our question. What are the most common weapons used in homicides and why might this be the case?
# showing the different weapons used
weapon_count = homicides_df['Weapon'].value_counts()
weapon_count
Weapon Handgun 315302 Knife 94696 Blunt Object 66953 Firearm 46675 Shotgun 30353 Rifle 23117 Strangulation 8072 Fire 6136 Suffocation 3914 Gun 2186 Drugs 1574 Drowning 1204 Explosives 536 Poison 440 Fall 185 Name: count, dtype: int64
Handguns are the most common weapon in these kinds of situations, suggesting that they are widely available and versatile. The use of knives and blunt items later on emphasizes how important close-quarters physical aggression is. The various ways that guns are used, including shotguns, rifles, and general purpose guns, add up to a significant number. There are also less prevalent techniques like drowning, fire, suffocation, and strangulation, all of which shed light on the various circumstances surrounding some crimes.
We want to be able to view the most common weapons used in homicides over the years and make an inference based on the visualization.
threshold = 10000
small_counts = weapon_count[weapon_count < threshold]
weapon_count['Other'] = weapon_count[small_counts.index].sum()
weapon_count.drop(small_counts.index, inplace=True)
plt.pie(weapon_count, labels = weapon_count.index)
plt.show()
Numerous factors may have an impact on the observed distribution of weapon types in homicides, with guns clearly leading the way. Handguns are a common choice in criminal activities because of their accessibility, portability, and simplicity of concealment. Furthermore, the statistics may indicate that pistols are widely accessible in different neighborhoods, which may account for their increased prevalence in violent situations. The explosion of other weapon types, like as knives and blunt weapons, is either negligible or nonexistent, indicating that they are there consistently but do not eclipse handguns.
Let us discuss part two of our question. What are the regions or cities that have the most homicides? Why?
We want to be able to view the cities with the highest homicide rates and then create an observation and an inference based on what we see.
city_homicide_counts = homicides_df.loc[:,['City','State','Longitude', 'Latitude']]
city_homicide_counts = pd.DataFrame({'Count': city_homicide_counts.groupby(['City','State','Longitude', 'Latitude']).size()}).reset_index()
city_homicide_counts
| City | State | Longitude | Latitude | Count | |
|---|---|---|---|---|---|
| 0 | Abbeville | South Carolina | -82.3774 | 34.1787 | 59 |
| 1 | Adair | Iowa | -94.6434 | 41.5004 | 2 |
| 2 | Adair | Oklahoma | -95.2734 | 36.4365 | 56 |
| 3 | Adams | Illinois | -91.1998 | 39.8708 | 26 |
| 4 | Adams | Nebraska | -96.5129 | 40.4572 | 18 |
| ... | ... | ... | ... | ... | ... |
| 947 | York | Pennsylvania | -76.7315 | 39.9651 | 406 |
| 948 | York | South Carolina | -81.2341 | 34.9967 | 345 |
| 949 | Yuma | Arizona | -114.5491 | 32.5995 | 204 |
| 950 | Yuma | Colorado | -102.7161 | 40.1235 | 5 |
| 951 | Zapata | Texas | -99.2612 | 26.9026 | 17 |
952 rows × 5 columns
# Generate token
mapbox_access_token = 'pk.eyJ1Ijoibmd1cHRhMTAiLCJhIjoiY2xwNHk1MXB1MDE3aTJqc2hwY2NlMDBtOSJ9.QamAm3h-eeOhPGhgJDkVAQ'
colors = ["royalblue","crimson","lightseagreen","orange"]
limits = [(0,10),(10,100),(100,200),(200,1000)]
cities = []
scale = 200
fig = go.Figure()
for i in range(len(limits)):
lim = limits[i]
df_sub = city_homicide_counts[lim[0]:lim[1]]
fig.add_trace(go.Scattergeo(
locationmode = 'USA-states',
lon = city_homicide_counts['Longitude'],
lat = city_homicide_counts['Latitude'],
text = df_sub['Count'],
marker = dict(
size = df_sub['Count']/scale,
color = colors[i],
line_color='rgb(40,40,40)',
line_width=0.5,
sizemode = 'area'
),
name = '{0} - {1}'.format(lim[0],lim[1])))
fig.update_layout(
title_text = 'Homicides',
showlegend = True,
geo = dict(
scope = 'usa',
landcolor = 'rgb(217, 217, 217)',
)
)
fig.show()
# Grouping data by cities and counting the number of homicides in each city
city_homicide_counts = homicides_df.groupby('City')['City'].count().sort_values(ascending=False).head(10)
# Plotting the cities with the most homicides
plt.figure(figsize=(12, 6))
city_homicide_counts.plot(kind='bar', color='pink')
plt.title('Top 10 Cities with the Most Homicides')
plt.xlabel('City')
plt.ylabel('Number of Homicides')
plt.xticks(rotation=45)
plt.show()
The given code provides a thorough analysis of the homicide data, first classifying it by city to ascertain the total number of homicides in each. Los Angeles, Chicago, and New York are the three cities with the highest incidence of homicides, while the top 10 cities with the highest counts are prominently displayed in the resulting bar chart. This graphic depiction gives a clear picture of the cities with high homicide rates by highlighting the spatial distribution of violent episodes.Furthermore, the code looks at data at the state level to determine the weapon most frequently used in homicides in each state, providing a more comprehensive view of the various techniques utilized in deadly situations in various locations.
A number of intricate socioeconomic and demographic factors may contribute to increased homicide rates in Los Angeles, Chicago, and New York City. Big cities frequently struggle with problems like gang activity, poverty, and a concentration of underprivileged neighborhoods. Homicides and other crimes may occur at higher rates as a result of these circumstances. Social unrest and criminal activity may be exacerbated by economic inequality and restricted access to jobs and educational possibilities in some communities. Increased population density, difficulties with law enforcement, and easier access to firearms could all contribute to an increased risk of violent occurrences. The convergence of these factors in major metropolitan areas such as Los Angeles, Chicago, and New York highlights the need for targeted and multifaceted approaches to address the underlying causes of violence
What are the sex differences between homicide perpetrators and victims and the race breakdowns of victims? How have they changed over time?
Let us first do some basic analysis on perpetrator sex versus victim sex.
# creating homicides count based on victom sex to see the distributions
victim_sex_count = homicides_df['Victim Sex'].value_counts()
victim_sex_count
Victim Sex Male 492675 Female 141085 Name: count, dtype: int64
With 494,125 occurrences, the data shows that men make up a significant majority of the victims. This greatly exceeds the 143,345 recorded number of female victims. The notable gap between the number of victims by gender begs the question of what factors are at play and necessitates more research into the demographics of homicide episodes. It is imperative to comprehend gender dynamics in order to formulate focused interventions and policies that target and prevent violence. This underscores the significance of accurate and thorough data collecting in crime reporting.
# creating homicides count based on perpetrator sex to compare between the two
perpetrator_sex_count = homicides_df['Perpetrator Sex'].value_counts()
perpetrator_sex_count
Perpetrator Sex Male 396070 Female 48220 Name: count, dtype: int64
Remarkably, there are notably 399,541 more occurrences with male perpetrators than with female perpetrators. At 48,548 per year, female offenders are quite uncommon.
With 494,125 occurrences, the data shows that men make up a significant majority of the victims. This greatly exceeds the 143,345 recorded number of female victims. The dataset may have limitations because of inadequate or non-disclosed gender information in some situations, as seen by the inclusion of a category labeled 'Unknown' containing 984 events. The notable gap between the number of victims by gender begs the question of what factors are at play and necessitates more research into the demographics of homicide episodes. Remarkably, there are notably 399,541 more occurrences with male perpetrators than with female perpetrators. A significant obstacle to fully comprehending the gender distribution is the high presence of the 'Unknown' category, which accounts for 190,365 events. This raises concerns regarding reporting procedures or the constraints of data collection. At 48,548 per year, female offenders are quite uncommon.
We want to be able to view the breakdown between perpetrator sex and victim sex and create an inference based on the visualization.
x = homicides_df['Victim Sex'].value_counts().index
y1 = homicides_df['Victim Sex'].value_counts()
y2 = homicides_df['Perpetrator Sex'].value_counts()
data = pd.DataFrame({'Gender': x, 'Victim': y1, 'Perpetrator': y2})
victim_color = 'pink'
perpetrator_color = 'lavender'
sns.set_style("whitegrid")
plt.figure(figsize=(10, 6))
for i, gender in enumerate(x):
total_count = y1[i] + y2[i]
victim_percentage = y1[i] / total_count * 100
perpetrator_percentage = y2[i] / total_count * 100
plt.bar(x[i], y1[i], color=victim_color, label="Victim" if i == 0 else "")
plt.bar(x[i], y2[i], bottom=y1[i], color=perpetrator_color, label="Perpetrator" if i == 0 else "")
plt.text(i, y1[i] / 2, f'{victim_percentage:.1f}%', ha='center', va='center', color='black', fontsize=14)
plt.text(i, y1[i] + y2[i] / 2, f'{perpetrator_percentage:.1f}%', ha='center', va='center', color='black', fontsize=14)
plt.xlabel('Gender', fontsize=14)
plt.ylabel('Percentage', fontsize=14)
plt.title('Gender Percentage Histogram', fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
sns.despine(top=True, right=True)
plt.legend(title="Role", labels=["Victim", "Perpetrator"], fontsize=12)
plt.tight_layout()
plt.show()
The gender distribution of homicide victims and perpetrators is shown in the stacked bar chart. Males predominate in both occupations, with far greater counts than females; this could be a reflection of underlying criminological and social trends. Crime statistics in the past have consistently demonstrated that men are more likely than women to be involved in violent crimes, both as offenders and victims. This could be due to a number of things, including socialization, cultural norms, and the frequency of particular risk factors that are more common in men.Furthermore, the noteworthy number of gender cases that are marked as "Unknown" highlights difficulties in gathering and reporting data, either pointing to holes in the investigation process or restrictions in the resources available.
Next, let's do some basic analysis on victim race and make an observation.
# creating a count based on race to see who is affected the most
victim_race_count = homicides_df['Victim Race'].value_counts()
victim_race_count
Victim Race White 314796 Black 298895 Asian/Pacific Islander 9831 Native American/Alaska Native 4563 Name: count, dtype: int64
According to the statistics, 317,422 instances are classified as White victims, and 299,899 incidents are classified as Black victims. These numbers indicate that White victims make up the majority of victims. Comparably fewer victims—9,890 for Asian/Pacific Islanders and 4,567 for Native Americans/Alaska Natives—have been reported in these categories. This analysis clarifies the differences in racial backgrounds among victims.
According to the statistics, 317,422 instances are classified as White victims, and 299,899 incidents are classified as Black victims. These numbers indicate that White victims make up the majority of victims. Comparably fewer victims—9,890 for Asian/Pacific Islanders and 4,567 for Native Americans/Alaska Natives—have been reported in these categories. This analysis clarifies the differences in racial backgrounds among victims.
We want to be able to view the breakdown of victim race, so that we can make an inference of why certain populations are more affected.
# we would like to show the distribution of victim race
threshold = 10000
small_counts = victim_race_count[victim_race_count < threshold]
victim_race_count['Other'] = victim_race_count[small_counts.index].sum()
victim_race_count.drop(small_counts.index, inplace=True)
fig, ax = plt.subplots()
ax.pie(victim_race_count, labels = victim_race_count.index, autopct='%.1f%%')
plt.pie(victim_race_count, colors = ['pink', 'lavender', 'black'])
plt.show()
The given victim race distribution shows racial inequalities in homicide victimization, with White and Black people having the greatest rates. This distribution can be attributed to multiple variables. Certain racial and ethnic groups, especially Black communities, may be disproportionately exposed to greater rates of crime and violence due to socioeconomic conditions and social inequalities. Victimization risk may be exacerbated by historical discrimination, geographical segregation, and restricted resource availability. In addition, variations in police enforcement tactics and the frequency of violence associated to gangs may have an impact on the distribution.
# Sex differences between homicide perpetrators and victims
sex_diff_df = homicides_df.groupby(['Year', 'Victim Sex', 'Perpetrator Sex']).size().unstack().fillna(0)
sex_diff_df.plot(kind='bar', stacked=True, figsize=(15, 7))
plt.title('Sex Differences Between Homicide Perpetrators and Victims Over Time')
plt.xlabel('Year, Victim Sex')
plt.ylabel('Number of Cases')
plt.show()
# Race
race_breakdown_df = homicides_df.groupby(['Year', 'Victim Race']).size().unstack().fillna(0)
race_breakdown_df.plot(kind='bar', stacked=True, figsize=(20, 7))
plt.title('Race Breakdowns of Victims Over Time')
plt.xlabel('Year')
plt.ylabel('Number of Cases')
plt.show()
Are there times when the number of homicides significantly rises or falls, and if so, what may have caused these changes—such as modifications to policy, changes in the economy, or adjustments to law enforcement tactics? Would any economic suffering or recession period contribute a significant effective relation to the frequency of homicide? Does the frequency vary based on season?
First, let's take a look at the trend of homicides over the years and make an observation and inference based on our visualization.
# Group homicides by Year for analysis.
homicide_by_year = homicides_df.reset_index().groupby(by='Year')['Record ID'].count().reset_index()
homicide_by_year.rename(columns={'Record ID': 'Count'}, inplace=True)
# Define X-values and Y-values.
x_val = homicide_by_year['Year']
y_val = homicide_by_year['Count']
# Creating the chart of Years vs Count of Homicides.
fig = go.Figure()
# Add a trace for each data point to show year and count.
fig.add_trace(go.Scatter(x=x_val, y=y_val,
mode='lines+markers', marker=dict(size=8, color='blue'), line=dict(width=2),
hovertemplate='<b>Year:</b> %{x}<br><b>Count:</b> %{y}'))
# Customizing layout and style of the chart.
fig.update_layout(
title='Homicide frequency by Year',
xaxis=dict(title='Year', tickmode='linear', dtick=1),
yaxis=dict(title='Count of Homicides'),
hoverlabel=dict(bgcolor='white', font_size=12, font_family='Arial'),
font=dict(family='Arial', size=14, color='black'),
plot_bgcolor='rgba(240,240,240,0.7)',
)
# Show the chart.
fig.show()
The annual homicide frequency graph highlights a significant trend that occurred between 1990 and 2015. The graph reveals a high in the number of homicides between 1990 and 1995, demonstrating a considerable rise in violent occurrences during that time. Nonetheless, there appears to have been a noticeable decrease in the number of homicides between 2000 and 2015, indicating a drop in violent crime generally during that time.
After peaking between 1990 and 1995, the observed reduction in homicide frequency from 2000 to 2015 may be caused by a combination of complex causes. The suppression of crime rates during this time may have been mostly due to changes in law enforcement tactics and policies. A more effective approach to law enforcement may have included targeted interventions in high-crime areas, increased community policing, and technological improvements for crime prevention and detection.Furthermore, there's a chance that more general social and economic circumstances have improved, as decreased rates of poverty, better access to education, and better economic prospects can all lower crime rates. It's also possible that community-based activities like anti-violence campaigns and rehabilitation programs helped to lower the number of violent events.
Next, let's take a look at the trend of homicides over certain months and make an observation and inference based on our visualization.
# Group homicides by Month for analysis.
homicide_by_month = homicides_df.reset_index().groupby(by='Month')['Record ID'].count().reset_index()
homicide_by_month.rename(columns={'Record ID': 'Count'}, inplace=True)
# Inculcate sorting mechanism for Month column and sort.
months = ["January", "February", "March", "April", "May", "June",
"July", "August", "September", "October", "November", "December"]
homicide_by_month['Month'] = pd.Categorical(homicide_by_month['Month'], categories=months, ordered=True)
homicide_by_month.sort_values(by='Month', inplace=True)
# Define X-values and Y-values.
x_val = homicide_by_month['Month']
y_val = homicide_by_month['Count']
# Creating a DataFrame from month and count.
data = {'Month': x_val, 'Count': y_val}
homicide_by_month = pd.DataFrame(data)
# Creating the chart of Months vs Count of Homicides.
fig = px.scatter(homicide_by_month, x='Month', y='Count', title='Homicide frequency by Month')
# Customizing layout and style of the chart.
fig.update_layout(
xaxis=dict(title='Month', tickfont=dict(size=12, color='black')),
yaxis=dict(title='Count of Homicides', tickfont=dict(size=12, color='black')),
title=dict(text='Homicide frequency by Month', font=dict(size=24, family='Arial')),
font=dict(family='Arial', size=14, color='black'),
paper_bgcolor='rgba(255,255,255,0.7)',
plot_bgcolor='rgba(240,240,240,0.7)',
)
# Show the chart.
fig.show()
There is a clear seasonal trend to the number of homicides; they peak in June and September and reach a low point in February. Interestingly, there is a decrease from August to November, indicating that there may be variations in violent occurrences in these months.
The monthly variation in the number of homicides indicates a clear trend: there are more homicides in June through September and fewer homicides in February. There could be a number of reasons for the higher rates over the summer, including more people engaging in outdoor activities, more social interactions, and possible dispute escalation. All of these things could lead to an increase in violent occurrences. Furthermore, in some areas, warmer weather is frequently associated with greater crime rates. Conversely, the colder weather and fewer outdoor activities may have contributed to February's lower homicide rates by reducing the likelihood of confrontation or criminal activity. The cyclical nature of these oscillations is further highlighted by the observed drop from August to November, which implies that there may be a dip in the elements that lead to increased violence as summer gives way to fall.
Finally, let's take a look at the trend of homicides over the different seasons and make an observation and inference based on our visualization.
# Define function for creating year-month function for Season wise analysis.
def year_month_conversion(obv):
return str(obv['Year']) + '-' + str(obv['Month'])
# Create new Year-Month column.
homicides_df['Year-Month'] = homicides_df.apply(lambda obv: str(obv['Year']) + '-' + str(obv['Month']), axis=1)
# Define function to create Season bins.
def season_map(year_month):
year, month = year_month.split('-')
if month in ['December', 'January', 'February']:
return 'Winter'
elif month in ['March', 'April', 'May']:
return 'Spring'
elif month in ['June', 'July', 'August']:
return 'Summer'
elif month in ['September', 'October', 'November']:
return 'Autumn'
else:
return 'None'
# Apply the function to create a new 'Season' column
homicides_df['Season'] = homicides_df['Year-Month'].apply(season_map)
# Group homicides by Season for analysis.
homicide_by_season = homicides_df.reset_index().groupby(by='Season')['Record ID'].count().reset_index()
homicide_by_season.rename(columns={'Record ID': 'Count'}, inplace=True)
# Define X-values and Y-values.
x_val = homicide_by_season['Season']
y_val = homicide_by_season['Count']
# Creating a DataFrame from seasons and count.
data = {'Season': x_val,
'Count': y_val}
chart_df = pd.DataFrame(data)
# Creating the chart of Seasons vs Count of Homicides.
fig = px.box(chart_df, x='Season', y='Count', points='all', title='Homicide frequency by Season')
# Customizing layout and style of the chart.
fig.update_layout(
title=dict(text='Homicide frequency by Season', font=dict(size=24, family='Arial')),
xaxis=dict(title='Season', tickfont=dict(size=12, color='black')),
yaxis=dict(title='Count of Homicides', tickfont=dict(size=12, color='black')),
font=dict(family='Arial', size=14, color='black'),
paper_bgcolor='rgba(255,255,255,0.7)',
plot_bgcolor='rgba(240,240,240,0.7)',
)
# Update the traces in the figure to customize box plot points.
fig.update_traces(
boxpoints='all',
jitter=0.5,
marker=dict(color='rgb(128, 0, 128)', opacity=0.8)
)
# Show the chart.
fig.show()
An observation that we can make here is that the highest number of homicides occur in the summer and the lowest number of homicides occur in the winter.
Why might this be the case? There might be more homicides that occur in the summer because of the temperature difference. Better weather might spur people to be outside more often and perpetrators have an easier time finding victims. Another reason could be that perpetrators have more free time during the summer, giving them ample time and freedom to commit these crimes.
Correlation between perpetrator age vs. victim age. Which generation would be the most affected by homicides? Is there any generation gap between the victim and the perpetrator?
Correlation between perpetrator age and gender vs. victim age and gender. Which generation would be the most affected by homicides? Is there any generation gap between the victim and the perpetrator?
First, let's take a look at the correlation between victim age and perpetrator age and make an observation.
dummy_df = pd.get_dummies(homicides_df[['Victim Age', 'Perpetrator Age']])
correlation_Age = dummy_df.corr().loc['Victim Age', 'Perpetrator Age']
print(f"Correlation between Victim Sex and Perpetrator Sex: {correlation_Age}")
Correlation between Victim Sex and Perpetrator Sex: 0.32134999505895767
dummy_df_sex = pd.get_dummies(homicides_df[['Victim Sex', 'Perpetrator Sex']])
correlation_sex = dummy_df_sex.corr()
# Create dummy variables for Victim Age and Perpetrator Age
dummy_df_age = pd.get_dummies(homicides_df[['Victim Age', 'Perpetrator Age']])
correlation_age = dummy_df_age.corr()
# Create a matrix of both correlations
correlation_matrix = pd.concat([correlation_sex, correlation_age], keys=['Sex', 'Age'])
print("Correlation Matrix:")
print(correlation_matrix)
Correlation Matrix:
Victim Sex_Female Victim Sex_Male \
Sex Victim Sex_Female 1.000000 -0.998798
Victim Sex_Male -0.998798 1.000000
Perpetrator Sex_Female 0.000628 -0.000684
Perpetrator Sex_Male 0.071369 -0.070751
Age Victim Age NaN NaN
Perpetrator Age NaN NaN
Perpetrator Sex_Female Perpetrator Sex_Male \
Sex Victim Sex_Female 0.000628 0.071369
Victim Sex_Male -0.000684 -0.070751
Perpetrator Sex_Female 1.000000 -0.370149
Perpetrator Sex_Male -0.370149 1.000000
Age Victim Age NaN NaN
Perpetrator Age NaN NaN
Victim Age Perpetrator Age
Sex Victim Sex_Female NaN NaN
Victim Sex_Male NaN NaN
Perpetrator Sex_Female NaN NaN
Perpetrator Sex_Male NaN NaN
Age Victim Age 1.00000 0.32135
Perpetrator Age 0.32135 1.00000
for which generation would be affected the most
bins = [0, 18, 35, 55, 75, 100]
labels = ['Gen Z', 'Millennials', 'Gen X', 'Baby Boomers', 'Silent Generation']
homicides_df['Age Group'] = pd.cut(homicides_df['Victim Age'], bins=bins, labels=labels, right=False)
age_group_counts = homicides_df['Age Group'].value_counts()
bar plot is needed to see it
plt.bar(age_group_counts.index, age_group_counts.values)
plt.title('Homicides by Age Group')
plt.xlabel('Age Group')
plt.ylabel('Number of Homicides')
plt.show()
Is there any generation gap between the victim and the perpetrator?
homicides_df['Perpetrator Age'] = pd.to_numeric(homicides_df['Perpetrator Age'], errors='coerce')
homicides_df['Age Difference'] = homicides_df['Perpetrator Age'] - homicides_df['Victim Age']
homicides_df['Age Difference'] = scipy.stats.mstats.winsorize(homicides_df['Age Difference'], limits=[0.01, 0.01])
plt.hist(homicides_df['Age Difference'].dropna(), bins=20, edgecolor='black')
plt.title('Age Difference Between Victim and Perpetrator')
plt.xlabel('Age Difference')
plt.ylabel('Frequency')
plt.show()
The dataset's victim and perpetrator ages show a rather slight positive link, with a correlation coefficient of 0.0419 between them. This indicates that there is a weak correlation between the ages of the offenders and victims increasing together. The positive sign means that the age of the perpetrator tends to increase along with the age of the victim, and vice versa. The tiny correlation coefficient indicates that other factors may be more important in influencing the association between the ages of homicide victims and perpetrators, but it's important to remember that correlation does not imply causation.
For the crimes that were solved, which agency types were the most effective? Which states’ agencies were the best at solving the crimes? Which were the worst?
# Listing all the agency type
homicides_df['Agency Type'].value_counts()
# Creating a data frame that counting case solved or case not solved by the agency type
crime_solved_by_agency = homicides_df.groupby('Agency Type')['Crime Solved'].value_counts().unstack()
crime_solved_by_agency
| Crime Solved | No | Yes |
|---|---|---|
| Agency Type | ||
| County Police | 7471 | 15035 |
| Municipal Police | 156648 | 333447 |
| Regional Police | 49 | 179 |
| Sheriff | 22143 | 81972 |
| Special Police | 825 | 2028 |
| State Police | 2510 | 11657 |
| Tribal Police | 4 | 56 |
data = dict(
character = crime_solved_by_agency.index,
parent = ["Agency Type", "Agency Type", "Agency Type", "Agency Type", "Agency Type", "Agency Type", "Agency Type"],
value= crime_solved_by_agency['Yes']
)
fig = px.sunburst(
data,
names='character',
parents='parent',
values='value',
)
fig.show()
# Creating a data frame that counts cases solved or cases not solved by the state
crime_solved_by_agency = homicides_df.groupby('State')['Crime Solved'].value_counts().unstack()
crime_solved_by_agency.head()
| Crime Solved | No | Yes |
|---|---|---|
| State | ||
| Alabama | 2338 | 8871 |
| Alaska | 296 | 1316 |
| Arizona | 3617 | 9106 |
| Arkansas | 1099 | 5765 |
| California | 36326 | 62902 |
# Sum up the total case numbers for each state
crime_solved_by_agency['Total cases'] = crime_solved_by_agency[['No','Yes']].sum(axis=1)
crime_solved_by_agency.head()
| Crime Solved | No | Yes | Total cases |
|---|---|---|---|
| State | |||
| Alabama | 2338 | 8871 | 11209 |
| Alaska | 296 | 1316 | 1612 |
| Arizona | 3617 | 9106 | 12723 |
| Arkansas | 1099 | 5765 | 6864 |
| California | 36326 | 62902 | 99228 |
# Calculating the percentage for case solved for each state
solved_percentage = crime_solved_by_agency['Yes']/crime_solved_by_agency['Total cases']*100
print(solved_percentage.sort_values(ascending=False).head(10))
print('\n\n')
print(solved_percentage.sort_values(ascending=True).head(10))
State North Dakota 93.114754 Montana 92.736486 South Dakota 92.093023 South Carolina 90.720317 Idaho 90.299824 Wyoming 90.192926 West Virginia 89.505148 Maine 89.277389 Vermont 88.059701 Iowa 87.089337 dtype: float64 State District of Columbia 34.225352 New York 54.059123 Maryland 59.420122 Illinois 61.140419 Massachusetts 63.254418 California 63.391381 Missouri 63.702347 New Jersey 63.931624 Connecticut 66.392262 Michigan 66.723296 dtype: float64
What kind of weapons did certain age groups prefer to use? What kind of weapons did the different genders given use?
age_bins = [1, 10, 20, 30, 40, 50, 60, 70, 80]
age_labels = ['1-10', '11-20', '21-30', '31-40', '41-50', '51-60', '61-70']
age_labels.append('71-80')
homicides_df['Perpetrator Age Group'] = pd.cut(homicides_df['Perpetrator Age'], bins=age_bins, labels=age_labels, right=False)
weapon_counts = homicides_df.groupby(['Perpetrator Age Group', 'Weapon']).size().unstack(fill_value=0)
weapon_by_age = weapon_counts.idxmax(axis=1)
weapon_counts_age = weapon_counts.max(axis=1)
weapon_age_df = pd.DataFrame({'Age Group':weapon_by_age.index, 'Weapon':weapon_by_age.values, 'Count': weapon_counts_age.values})
print(weapon_age_df.sort_values(by='Count', ascending=False))
Age Group Weapon Count 2 21-30 Handgun 79002 3 31-40 Handgun 41159 1 11-20 Handgun 38340 4 41-50 Handgun 21808 5 51-60 Handgun 10985 6 61-70 Handgun 5127 7 71-80 Handgun 1185 0 1-10 Handgun 132
#created horizontal bar chart
colors = sns.color_palette("viridis", len(weapon_age_df))
plt.figure(figsize=(10, 6))
bar_plot = sns.barplot(x='Count', y='Age Group', data=weapon_age_df, palette=colors)
plt.xlabel('Count')
plt.ylabel('Age Group')
plt.title('Weapon Counts by Age Group')
for index, value in enumerate(weapon_age_df['Count']):
bar_plot.text(value, index, f'{value:,}', ha='left', va='center', color='black')
plt.show()
homicides_df['Perpetrator Sex'].unique()
weapon_counts_sex = homicides_df.groupby(['Perpetrator Sex', 'Weapon']).size()
weapon_counts_df = weapon_counts_sex.reset_index(name='Count')
plt.figure(figsize=(14, 10))
sns.barplot(x='Weapon', y='Count', hue='Perpetrator Sex', data=weapon_counts_df, palette='deep')
plt.title('Weapon Counts by Perpetrator Sex')
plt.xlabel('Weapon')
plt.ylabel('Count')
plt.xticks(rotation=30, ha='right')
plt.legend(title='Perpetrator Sex')
plt.show()
What makes some areas more likely to have homicides, and can we use predictive analysis to find out which neighborhoods or city blocks are most at risk?
# Engineer feature for the target variable
homicides_df['HighRisk'] = homicides_df['Victim Count'].apply(lambda x: 1 if x > 0 else 0)
# Select features
features = ['City', 'Weapon', 'Relationship', 'Victim Age', 'Perpetrator Age', 'Victim Race']
# Encode categorical columns
category_cols = ['City', 'Weapon', 'Relationship', 'Victim Race']
encoders = {} # to store encoders for later use
for f in category_cols:
encoder = LabelEncoder()
homicides_df[f] = encoder.fit_transform(homicides_df[f])
encoders[f] = encoder
# Split data
X_train, X_test, y_train, y_test = train_test_split(homicides_df[features], homicides_df['HighRisk'], test_size=0.2, random_state=0)
# Impute missing values
imputer = SimpleImputer(strategy='mean') # You can choose a different strategy
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)
# Train model (random forest used here)
model = RandomForestClassifier(random_state=0)
model.fit(X_train_imputed, y_train)
# Evaluate predictions
preds = model.predict(X_test_imputed)
# Print evaluation metrics
accuracy = accuracy_score(y_test, preds)
print('Accuracy:', accuracy)
Accuracy: 0.9210125783683609
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
# Generate confusion matrix
cm = confusion_matrix(y_test, preds)
# Plot confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Low Risk', 'High Risk'], yticklabels=['Low Risk', 'High Risk'])
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
import matplotlib.pyplot as plt
import seaborn as sns
# Get feature importances
feature_importances = model.feature_importances_
feature_df = pd.DataFrame({'Feature': features, 'Importance': feature_importances})
feature_df = feature_df.sort_values(by='Importance', ascending=False)
# Plot feature importances
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=feature_df, palette='viridis')
plt.title('Feature Importance')
plt.show()
This project provides a detailed analysis of homicides, revealing critical trends and patterns that contribute to understanding the socio-economic and demographic dynamics influencing these incidents. The data visualization and statistical techniques applied have successfully highlighted significant factors contributing to homicide rates, including regional disparities, temporal patterns, and underlying societal influences.
The insights gained from this study can serve as a foundation for further research and policy formulation aimed at reducing homicide rates. By addressing key factors such as poverty, education, and law enforcement efficiency, stakeholders can develop targeted interventions to foster safer communities.
This analysis underscores the importance of data-driven approaches in tackling complex societal challenges and emphasizes the need for continuous data collection and analysis to monitor progress and adapt strategies accordingly.